Chapter 2 - Regression and model validation

I have created an appropriate analysis dataset and excluded unwanted observations. Here we have analysis on the dataset and it’s variables. - Describe your work and results clearly. - Assume the reader has an introductory course level understanding of writing and reading R code as well as statistical methods - Assume the reader has no previous knowledge of your data or the more advanced methods you are using

For these excercises the libraries “dplyr”, “ggplot2” and “GGally” are necessary. When reading the code these need to be installed and read. I have also included cache = F, since without it my version of R refused to knit the plots I made

  1. First we read the dataset I created into R and explore its contents, dimensions and structure
cache=F
 library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggplot2")
library("GGally")
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library("lattice")
Students2014 <- read.table("~/Documents/IODS-project/data/learning2014/learning2014.txt", header = TRUE, sep = " ")
print(Students2014)
##     gender Age Attitude     deep  stra     surf Points
## 1        F  53       37 3.583333 3.375 2.583333     25
## 2        M  55       31 2.916667 2.750 3.166667     12
## 3        F  49       25 3.500000 3.625 2.250000     24
## 4        M  53       35 3.500000 3.125 2.250000     10
## 5        M  49       37 3.666667 3.625 2.833333     22
## 6        F  38       38 4.750000 3.625 2.416667     21
## 7        M  50       35 3.833333 2.250 1.916667     21
## 8        F  37       29 3.250000 4.000 2.833333     31
## 9        M  37       38 4.333333 4.250 2.166667     24
## 10       F  42       21 4.000000 3.500 3.000000     26
## 11       M  37       39 3.583333 3.625 2.666667     31
## 12       F  34       38 3.833333 4.750 2.416667     31
## 13       F  34       24 4.250000 3.625 2.250000     23
## 14       F  34       30 3.333333 3.500 2.750000     25
## 15       M  35       26 4.166667 1.750 2.333333     21
## 16       F  33       41 3.666667 3.875 2.333333     31
## 17       F  32       26 4.083333 1.375 2.916667     20
## 18       F  44       26 3.500000 3.250 2.500000     22
## 19       M  29       17 4.083333 3.000 3.750000      9
## 20       F  30       27 4.000000 3.750 2.750000     24
## 21       M  27       39 3.916667 2.625 2.333333     28
## 22       M  29       34 4.000000 2.375 2.416667     30
## 23       F  31       27 4.000000 3.625 3.000000     24
## 24       F  37       23 3.666667 2.750 2.416667      9
## 25       F  26       37 3.666667 1.750 2.833333     26
## 26       F  26       44 4.416667 3.250 3.166667     32
## 27       M  30       41 3.916667 4.000 3.000000     32
## 28       F  33       37 3.750000 3.625 2.000000     33
## 29       F  33       25 3.250000 2.875 3.500000     29
## 30       M  28       30 3.583333 3.000 3.750000     30
## 31       M  26       34 4.916667 1.625 2.500000     19
## 32       F  27       32 3.583333 3.250 2.083333     23
## 33       F  25       20 2.916667 3.500 2.416667     19
## 34       F  31       24 3.666667 3.000 2.583333     12
## 35       M  20       42 4.500000 3.250 1.583333     10
## 36       F  39       16 4.083333 1.875 2.833333     11
## 37       M  38       31 3.833333 4.375 1.833333     20
## 38       M  24       38 3.250000 3.625 2.416667     26
## 39       M  26       38 2.333333 2.500 3.250000     31
## 40       M  25       33 3.333333 1.250 3.416667     20
## 41       F  30       17 4.083333 4.000 3.416667     23
## 42       F  25       25 2.916667 3.000 3.166667     12
## 43       M  30       32 3.333333 2.500 3.500000     24
## 44       F  48       35 3.833333 4.875 2.666667     17
## 45       F  24       32 3.666667 5.000 2.416667     29
## 46       F  40       42 4.666667 4.375 3.583333     23
## 47       M  25       31 3.750000 3.250 2.083333     28
## 48       F  23       39 3.416667 4.000 3.750000     31
## 49       F  25       19 4.166667 3.125 2.916667     23
## 50       F  23       21 2.916667 2.500 2.916667     25
## 51       M  27       25 4.166667 3.125 2.416667     18
## 52       M  25       32 3.583333 3.250 3.000000     19
## 53       M  23       32 2.833333 2.125 3.416667     22
## 54       F  23       26 4.000000 2.750 2.916667     25
## 55       F  23       23 2.916667 2.375 3.250000     21
## 56       F  45       38 3.000000 3.125 3.250000      9
## 57       F  22       28 4.083333 4.000 2.333333     28
## 58       F  23       33 2.916667 4.000 3.250000     25
## 59       M  21       48 3.500000 2.250 2.500000     29
## 60       M  21       40 4.333333 3.250 1.750000     33
## 61       F  21       40 4.250000 3.625 2.250000     33
## 62       F  21       47 3.416667 3.625 2.083333     25
## 63       F  26       23 3.083333 2.500 2.833333     18
## 64       F  25       31 4.583333 1.875 2.833333     22
## 65       F  26       27 3.416667 2.000 2.416667     17
## 66       M  21       41 3.416667 1.875 2.250000     25
## 67       F  23       34 3.416667 4.000 2.833333     28
## 68       F  22       25 3.583333 2.875 2.250000     22
## 69       F  22       21 1.583333 3.875 1.833333     26
## 70       F  22       14 3.333333 2.500 2.916667     11
## 71       F  23       19 4.333333 2.750 2.916667     29
## 72       M  22       37 4.416667 4.500 2.083333     22
## 73       M  23       32 4.833333 3.375 2.333333     21
## 74       M  24       28 3.083333 2.625 2.416667     28
## 75       F  22       41 3.000000 4.125 2.750000     33
## 76       F  23       25 4.083333 2.625 3.250000     16
## 77       M  22       28 4.083333 2.250 1.750000     31
## 78       M  20       38 3.750000 2.750 2.583333     22
## 79       M  22       31 3.083333 3.000 3.333333     31
## 80       M  21       35 4.750000 1.625 2.833333     23
## 81       F  22       36 4.250000 1.875 2.500000     26
## 82       F  23       26 4.166667 3.375 2.416667     12
## 83       M  21       44 4.416667 3.750 2.416667     26
## 84       M  22       45 3.833333 2.125 2.583333     31
## 85       M  29       32 3.333333 2.375 3.000000     19
## 86       F  29       39 3.166667 2.750 2.000000     30
## 87       F  21       25 3.166667 3.125 3.416667     12
## 88       M  28       33 3.833333 3.500 2.833333     17
## 89       F  21       33 4.250000 2.625 2.250000     18
## 90       F  30       30 3.833333 3.375 2.750000     19
## 91       F  21       29 3.666667 2.250 3.916667     21
## 92       M  23       33 3.833333 3.000 2.333333     24
## 93       F  21       33 3.833333 4.000 2.750000     28
## 94       F  21       35 3.833333 3.500 2.750000     17
## 95       F  20       36 3.666667 2.625 2.916667     18
## 96       M  22       37 4.333333 2.500 2.083333     17
## 97       M  21       42 3.750000 3.750 3.666667     23
## 98       M  21       32 4.166667 3.625 2.833333     26
## 99       F  20       50 4.000000 4.125 3.416667     28
## 100      M  22       47 4.000000 4.375 1.583333     31
## 101      F  20       36 4.583333 2.625 2.916667     27
## 102      F  20       36 3.666667 4.000 3.000000     25
## 103      M  24       29 3.666667 2.750 2.916667     23
## 104      F  20       35 3.833333 2.750 2.666667     21
## 105      F  19       40 2.583333 1.375 3.000000     27
## 106      F  21       35 3.500000 2.250 2.750000     28
## 107      F  21       32 3.083333 3.625 3.083333     23
## 108      F  22       26 4.250000 3.750 2.500000     21
## 109      F  25       20 3.166667 4.000 2.333333     25
## 110      F  21       27 3.083333 3.125 3.000000     11
## 111      F  22       32 4.166667 3.250 3.000000     19
## 112      F  25       33 2.250000 2.125 4.000000     24
## 113      F  20       39 3.333333 2.875 3.250000     28
## 114      M  24       33 3.083333 1.500 3.500000     21
## 115      F  20       30 2.750000 2.500 3.500000     24
## 116      M  21       37 3.250000 3.250 3.833333     24
## 117      F  20       25 4.000000 3.625 2.916667     20
## 118      F  20       29 3.583333 3.875 2.166667     19
## 119      M  31       39 4.083333 3.875 1.666667     30
## 120      F  20       36 4.250000 2.375 2.083333     22
## 121      F  22       29 3.416667 3.000 2.833333     16
## 122      F  22       21 3.083333 3.375 3.416667     16
## 123      M  21       31 3.500000 2.750 3.333333     19
## 124      M  22       40 3.666667 4.500 2.583333     30
## 125      F  21       31 4.250000 2.625 2.833333     23
## 126      F  21       23 4.250000 2.750 3.333333     19
## 127      F  21       28 3.833333 3.250 3.000000     18
## 128      F  21       37 4.416667 4.125 2.583333     28
## 129      F  20       26 3.500000 3.375 2.416667     21
## 130      F  21       24 3.583333 2.750 3.583333     19
## 131      F  25       30 3.666667 4.125 2.083333     27
## 132      M  21       28 2.083333 3.250 4.333333     24
## 133      F  24       29 4.250000 2.875 2.666667     21
## 134      F  20       24 3.583333 2.875 3.000000     20
## 135      M  21       31 4.000000 2.375 2.666667     28
## 136      F  20       19 3.333333 3.875 2.166667     12
## 137      F  20       20 3.500000 2.125 2.666667     21
## 138      F  18       38 3.166667 4.000 2.250000     28
## 139      F  21       34 3.583333 3.250 2.666667     31
## 140      F  19       37 3.416667 2.625 3.333333     18
## 141      F  21       29 4.250000 2.750 3.500000     25
## 142      F  20       23 3.250000 4.000 2.750000     19
## 143      M  21       41 4.416667 3.000 2.000000     21
## 144      F  20       27 3.250000 3.375 2.833333     16
## 145      F  21       35 3.916667 3.875 3.500000      7
## 146      F  20       34 3.583333 3.250 2.500000     21
## 147      F  18       32 4.500000 3.375 3.166667     17
## 148      M  22       33 3.583333 4.125 3.083333     22
## 149      F  22       33 3.666667 3.500 2.916667     18
## 150      M  24       35 2.583333 2.000 3.166667     25
## 151      F  19       32 4.166667 3.625 2.500000     24
## 152      F  20       31 3.250000 3.375 3.833333     23
## 153      F  20       28 4.333333 2.125 2.250000     23
## 154      F  17       17 3.916667 4.625 3.416667     26
## 155      M  19       19 2.666667 2.500 3.750000     12
## 156      F  20       35 3.083333 2.875 3.000000     32
## 157      F  20       24 3.750000 2.750 2.583333     22
## 158      F  20       21 4.166667 4.000 3.333333     20
## 159      F  20       29 4.166667 2.375 2.833333     21
## 160      F  19       19 3.250000 3.875 3.000000     23
## 161      F  19       20 4.083333 3.375 2.833333     20
## 162      F  22       42 2.916667 1.750 3.166667     28
## 163      M  35       41 3.833333 3.000 2.750000     31
## 164      F  18       37 3.166667 2.625 3.416667     18
## 165      F  19       36 3.416667 2.625 3.000000     30
## 166      M  21       18 4.083333 3.375 2.666667     19
#2.
dim(Students2014)
## [1] 166   7
str(Students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Second we display graphical implementation of the data

Overview:

plot(Students2014)

Results without variable ‘gender’: pairs(Students2014[-1])

pairs(Students2014[-1])

Last, a ggpairs-graphic for possibly a more clear display:

ggpairs(Students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

Next we have summaries for the included variables and scatterplots to clarify their influence on the ‘points’ -variable:

library(ggplot2)
qplot(Attitude, Points, data = Students2014) + geom_smooth(method = "lm")

qplot(Age, Points, data = Students2014) + geom_smooth(method = "lm")

qplot(gender, Points, data = Students2014) + geom_smooth(method = "lm")

qplot(deep, Points, data = Students2014) + geom_smooth(method = "lm")

qplot(stra, Points, data = Students2014) + geom_smooth(method = "lm")

qplot(surf, Points, data = Students2014) + geom_smooth(method = "lm")

Next some more illustration of the effects of the other variables to the ‘points’-variable and summaries of these

  1. In the following simple regression models Points is used as the target variable while other variables are used as explanatory variables. There is also a summary provided for all of the individual regressions.
M_Attitude <- lm(Points ~ Attitude, data = Students2014)
summary(M_Attitude)
## 
## Call:
## lm(formula = Points ~ Attitude, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
M_Age <- lm(Points ~ Age, data = Students2014)
summary(M_Age)
## 
## Call:
## lm(formula = Points ~ Age, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0360  -3.7531   0.0958   4.6762  10.8128 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.52150    1.57339  15.585   <2e-16 ***
## Age         -0.07074    0.05901  -1.199    0.232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared:  0.008684,   Adjusted R-squared:  0.00264 
## F-statistic: 1.437 on 1 and 164 DF,  p-value: 0.2324
M_gender <- lm(Points ~ gender, data = Students2014)
summary(M_gender)
## 
## Call:
## lm(formula = Points ~ gender, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3273  -3.3273   0.5179   4.5179  10.6727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22.3273     0.5613  39.776   <2e-16 ***
## genderM       1.1549     0.9664   1.195    0.234    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared:  0.008632,   Adjusted R-squared:  0.002587 
## F-statistic: 1.428 on 1 and 164 DF,  p-value: 0.2338
M_deep <- lm(Points ~ deep, data = Students2014)
summary(M_deep)
## 
## Call:
## lm(formula = Points ~ deep, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6913  -3.6935   0.2862   4.9957  10.3537 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.1141     3.0908   7.478 4.31e-12 ***
## deep         -0.1080     0.8306  -0.130    0.897    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared:  0.000103,   Adjusted R-squared:  -0.005994 
## F-statistic: 0.01689 on 1 and 164 DF,  p-value: 0.8967
M_stra <- lm(Points ~ stra, data = Students2014)
summary(M_stra)
## 
## Call:
## lm(formula = Points ~ stra, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5581  -3.8198   0.1042   4.3024  10.1394 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.233      1.897  10.141   <2e-16 ***
## stra           1.116      0.590   1.892   0.0603 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared:  0.02135,    Adjusted R-squared:  0.01538 
## F-statistic: 3.578 on 1 and 164 DF,  p-value: 0.06031
M_surf <- lm(Points ~ surf, data = Students2014)
summary(M_surf)
## 
## Call:
## lm(formula = Points ~ surf, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6539  -3.3744   0.3574   4.4734  10.2234 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.2017     2.4432  11.134   <2e-16 ***
## surf         -1.6091     0.8613  -1.868   0.0635 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared:  0.02084,    Adjusted R-squared:  0.01487 
## F-statistic:  3.49 on 1 and 164 DF,  p-value: 0.06351

Looking at the output of the summaries, the most trustworthy explanatory variables are Attitude, Age, surf and stra. These are confirmed by the scatterplots of the individual dependences of points on the other variables. Age, Attitude and stra can be seen to probably have influence on points. The values of the regression make these assumption reasonable, with a quite low margin for failure. (Failure in this case means selecting output or in other words a random sample where the assumptions fail to reflect or descibe the truth/dependence of the variables).

The mean, standard error and variance for the dataset’s variables:

sapply(Students2014, mean, na.rm = TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##    gender       Age  Attitude      deep      stra      surf    Points 
##        NA 25.512048 31.427711  3.679719  3.121235  2.787149 22.716867
sapply(Students2014, sd, na.rm = TRUE)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm): Calling var(x) on a factor x is deprecated and will become an error.
##   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
##    gender       Age  Attitude      deep      stra      surf    Points 
## 0.4742358 7.7660785 7.2990794 0.5541369 0.7718318 0.5288405 5.8948836
sapply(Students2014, var, na.rm = TRUE)
## Warning in FUN(X[[i]], ...): Calling var(x) on a factor x is deprecated and will become an error.
##   Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
##     gender        Age   Attitude       deep       stra       surf 
##  0.2248996 60.3119752 53.2765608  0.3070677  0.5957244  0.2796722 
##     Points 
## 34.7496532

Below is a regression model where exam points is the target/dependent variable, with three explanatory variables. These explanatory variables were chosen because it can be seen that they correlate with the variable that we are attempting to explain (points). At the same time the model is drawn in several different ways to help interpret and understand it’s relevance.

Model3 <- lm(formula = Points ~ Attitude + Age + stra, data = Students2014)
plot(Model3)

The following functions work in my R-project, but for some reason they refused to knit to HTML. I have taken measures to enable the knitting of error terms etc, and it still won’t work. (I used these functions to make some interesting plots that I discuss later in the exercise. These were not individually necessary for ch2, and I have provided a collection of the mentioned plots in an other way also. This collection is included in the code). Still, I wanted to post them here to show what I did in another way. Readers please note: These were not as such demanded, and are provided in the necessary form later. If experimentation is done on them, I recommend copying them to an R-document, along with all the other necessary elements

r.squared(Model3, model = NULL, type = c(“Attitude”, “Age”, “stra”), dfcor = TRUE) #Normal Q-Q r.squared(Model3, model = NULL, type = c(“Attitude”, “Age”, “stra”), dfcor = FALSE) r.squared(Model3, model = “lm”, type = c(“Attitude”, “Age”, “stra”), dfcor = TRUE) #Res vs Lev r.squared(Model3, model = “lm”, type = c(“Attitude”, “Age”, “stra”), dfcor = FALSE) #Res vs Fit

  1. Finally, a summary of the model:
summary(Model3)
## 
## Call:
## lm(formula = Points ~ Attitude + Age + stra, data = Students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## Attitude     0.34808    0.05622   6.191 4.72e-09 ***
## Age         -0.08822    0.05302  -1.664   0.0981 .  
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The summary provides seemingly significant results, with considerably high correlation to points of all the other variables. The error has remained quite low in relation to the amount of variables used. The r-squared shows generally how close the data on the variables is to the regression line. Experimenting with different inputs to the function r-squared 4 different but interesting outputs can be found with this ampunt of experimentation. It seems that the Multiple R^2 is quite low, so it’s not too significant (It doesn’t deny the validity of the regression. The variants of the r-squared functions produced the Residuals vs Fitted, Residuals vs Leverage, and Normal Q-Q plots, as well as Scale location.

  1. Next, three diagnostic plots are combined to help consider the validity of the model
par(mfrow = c(2,2))
plot(Model3, which = c(1,2,5))

Linear regression models have a few general assumtions: 1. Linearity 2. The errors of the model are normally distributed. 2. The errors are not correlated. 3. The sizes of the errors do not depend on the variables used to explain the target variable.

Now let’s think how the plots produced correspond (or not) to these assumption:

The Q–Q-plot demonstrates how the standardised residuals of the model fit to the theory or reasoning behind the model. Therefore the normal distribution assumption seems to be true for the model.

The residuals vs. fitted values -plot does not seem to be regular/subjected to any pattern, meaning that the errors are not correlated to the explanatory variables and their size is independent.

Therefore all of the assumptions are valid for the model created.